AD- A 176  475 


REPORT  DOCUMENTATION  PAGE 


SECURITY  CLASSIFICATION  OF  THIS  FACE  |m<n  Oa fa  Bnlorod) 


READ  INSTRUCTIONS 
BEFORE  COMPLETING  FORM 


2.  GOVT  ACCESSION  NO.I  3.  RECIPIENT'S  CATALOG  NUMBER 


5  TYPE  OF  REPORT  A  PERIOD  COVEREO 

Final  report;  April  1,  1984 
-  Sept.  30,  1986 


4.  TITLE  (md  Sublltl,) 


"Distributed  System  Modelling  and  Analysis 


1-  AUTHORf.J 

Nancy  A.  Lynch 


t.  PERFORMING  ORGANIZATION  NAME  AND  ADDRESS 

Massachusetts  Institute  of  Technology 
Laboratory  for  Computer  Science 
Cambridge,  MA.  02139 _ 

CONTROLLING  OFFICE  NAME  ANO  ADDRESS 

U.  S.  Army  Research  Office 
Post  Office  Box  12211 


6.  PERFORMING  ORG.  REPORT  NUMBER 


B-  CONTRACT  OR  GRANT  N U M 8 E Rf a) 


DAAG29-84-K-0058 


10.  PROGRAM  ELEMENT.  PROJECT.  TASK 
AREA  A  WORK  UNIT  NUMBERS 


12.  REPORT  DATE 

December  22.  1986 


13.  NUMBER  OF  PAGES 


MONITORING  AGENCY'S! AME  A  AODRESS^/f  dlftmrwnl  from  Controlling  Oftlc,)  IS.  SECURITY  CLASS,  (of  thlt  roport) 

Unclassified 

15a.  DECLASSI  Fl  CATION/ DOWNGRADING 
SCHEDULE 

DISTRIBUTION  STATEMENT  ( o I  Oil,  Rmporl) 

Approved  for  public  release;  distribution  unlimited. 


DISTRIBUTION  STATEMENT  (of  thm  ebctrmct  mntcrmd  In  Block  20,  It  different  from  Report) 


_ 


ir  supplementary  notes 

The  view,  opinions,  and/or  findings  contained  in  this  report  are 
those  of  the  author(s)  and  should  not  be  construed  as  an  official 
Department  of  the  Army  position,  policy,  or  decision,  unless  so 

_ hv  nfhpr  rlnrnmpnfaf'inn  . 


>•-  KEY  WOROS  (Continue  on  revetee  cl  dm  It  neceecmry  end  Identity  by  block  number) 

Distributed  algorithms,  lower  bounds,  distributed  consensus,  Byzantine  agreement 
concurrency  control,  nested  transactions. 


ABSTRACT  (T’entfmum  mm  revere*  NB  tt  nmee emery  mad.  tdmnitty  by  block  number) 

This  project  has  made  considerable  progress  in  developing  theoretical  foundations 
for  distributed  computing.  The  primary  thrust  of  the  work  has  been  the  design  of 
distributed  algorithms  and  the  proof  of  upper  and  lower  complexity  bounds  for 
distributed  problems.  The  kinds  of  problems  studied  include  distributed  con¬ 
sensus  in  the  presence  of  faults,  resource  allocation,  and  election  of  a  leader. 

A  secondary  effort  has  involved  the  development  of  formal  semantic  models  for 
distributed  algorithm.  A  tertiary  effort  has  involved  the  modelling,  specifica¬ 
tion  and  verification  of  concurrency  control  and  recovery  algorithms  for  nested 


«  jt 


EDITION  OF  •  MOV  U  IS  OBSOLETE 


UNCLASSIFIED 

SECURITY  CLASSIFICATION  OF  THIS  PAGE  f»»iwn  Dmfm  Enfrtd) 


"Distributed  System  Modelling  and  Analysis" 


Final  Report 


Professor  Nancy  A.  Lynch 


December  23,  1986 


U.S.  ARMY  RESEARCH  OFFICE 


Contract  Number  DAAG29-84-K-0058 


Massachusetts  Institute  of  Technology 


APPROVED  FOR  PUBLIC  RELEASE; 
DISTRIBUTION  UNLIMITED. 


THE  VIEW,  OPINION,  AND/OR  FINDINGS  CONTAINED  IN  THIS  REPORT  ARE  THOSE  OF  THE 
AUTHOR(S)  AND  SHOULD  NOT  BE  CONSTRUED  AS  AN  OFFICIAL  DEPARTMENT  OF  THE 
ARMY  POSITION,  POLICY,  OR  DECISION,  UNLESS  SO  DESIGNATED  BY  OTHER 


DOCUMENTATION. 


5  05? 


' *  ' s  ' ^ -’.jf „_<■  *>?>  )  •  \  ■  \  /.■ 


a.  Statement  of  the  Problem  Studied 


TABLE  OF  CONTENTS 


b.  Results 


I.  Analysis  of  Algorithms 

A.  Distributed  Consensus 

B.  Approximate  Agreement 

C.  Clock  Synchronization 

D.  Electing  a  Leader 

E.  Network  Resource  Allocation 

F.  Atomic  Registers 

G.  Other  Network  Problems 

II.  Models 

III.  Concurrency  Control  and  Recovery 

A.  Nested  Transactions 

B.  Highly  Available  Replicated  Data  Systems 

c.  Publications  and  Technical  Reports 

d.  Participating  Scientific  Personnel 

e.  Bibliography 


Accesion  For 

NT1S  CRA&i 
DT1C  TAB 
U.;anno.i.iced 
JjLtlfiCatiOd  . 


By 


Di~t  ib. 

tion  / 

Availability  Codes 

Dist 

Avail  and  |  or 
Special 

4'l 

a.  Statement  of  the  Problem  Studied 

This  project  is  intended  to  develop  theoretical  foundations  for  distributed  computing.  The  primary  goal 
of  the  work  has  been  the  design  i  istributed  algorithms  and  the  proof  of  upper  and  lower  complexity 
bounds  for  interesting  distributed  problems.  The  kinds  of  problems  studied  include  distributed  consensus 
in  the  presence  of  faults,  resource  allocation,  and  election  of  a  leader. 

A  secondary  goal  has  been  the  development  of  formal  semantic  models  for  concurrent  and  distributed 
algorithms,  in  a  way  which  would  clarify  the  commonality  among  various  different  kinds  of  concurrent 
algorithms  (shared  memory  algorithms,  message-passing  algorithms,  concurrency  control  algorithms, 
dataflow  algorithms,  etc.) 


A  tertiary  goal  has  involved  the  modelling,  specification  and  verification  of  concurrency  control  and 
recovery  algorithms  for  nested  transaction  systems. 
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b.  Results 

I.  Analysis  of  Algorithms 

A.  Distributed  Consensus 

In  DLS],  we  devise  algorithms  for  the  problem  of  reaching  agreement  in  a  realistic  distributed  system 
model  that  lies  between  the  completely  synchronous  and  completely  asynchronous  models.  In  this  model, 
messages  have  a  “usual  delivery  bound",  which  need  not  always  hold.  In  our  solutions,  disagreement  can 
never  be  reached,  no  matter  how  the  messages  behave.  Moreover,  if  the  messages  get  delivered  within 
their  usual  delivery  bound  for  a  sufficiently  long  interval  of  time,  then  agreement  is  guaranteed. 
Algorithms  are  given  for  various  fault  models,  along  with  matching  lower  bounds. 

The  first  version  of  [DLS]  included  separate  proofs  for  all  the  results.  In  preparing  a  journal  version  of 
DLS],  we  discovered  a  better  way  of  organizing  the  results.  Namely,  we  discovered  a  natural  abstract 
partially  synchronous  model  which  can  be  used  to  present  the  algorithms,  and  a  collection  of  reductions 
which  allow  the  various  models  in  the  paper  to  simulate  the  abstract  model. 

In  BL.CDDSj,  we  introduce  and  study  a  new  and  fundamental  problem  for  distributed  systems,  which 
we  call  the  Byzantine  Firing  Squad  Problem.  The  problem  is  for  remote  processes  to  manage  to  carry  out 
some  specified  action  at  the  same  time,  in  a  setting  where  the  processes  wake  up  at  different  times,  and 
where  some  of  the  processes  are  faulty.  We  obtain  algorithms  and  lower  bounds  for  a  variety  of  fault 
models.  Our  bounds  are  tight  for  all  but  one  of  the  models. 

In  FLM1,  we  demonstrate  a  new  technique  for  proving  lower  bounds  on  the  number  of  processors 
needed  to  solve  various  distributed  consensus  problems.  We  have  been  able  to  unify  a  large  collection  of 
previous  work  on  impossibility  for  various  kinds  of  distributed  consensus,  and  add  several  new  results, 
using  a  new  "shifting  scenarios"  technique.  Many  of  the  results  were  previously  known,  with  very 
complex  proofs.  There  are  some  new  results,  however,  in  the  area  of  clock  synchronization.  The  paper 
was  the  highest-rated  submission  to  the  1985  PODC  conference,  and  was  invited  to  appear  in  the  flagship 
issue  of  the  new  Springer-Verlag  Journal  on  Distributed  Computing. 

The  work  in  CMS  provides  lower  bounds  for  the  expected  time  to  reach  Byzantine  agreement  in  a 
variety  of  fault  models. 

In  C.'C  ,  we  have  developed  a  randomized  Byzantine  agreement  algorithm  that  terminates  in  an  expected 
number  of  rounds  that  is  smaller  than  the  known  lower  bound  (due  to  Lynch  and  Fischer)  on  the  number 
of  rounds  required  by  a  deterministic  Byzantine  agreement  algorithm  The  algorithm  is  of  interest  for 
several  reasons.  It  is  simple  and  efficient  enough  to  be  of  practical  importance.  Also,  it  is  an  example  of 
a  situation  where  randomization  improves  on  the  problem  solving  power  of  n  system  of  computers 


Although  randomization  is  vital  to  the  algorithm,  it  is  used  sparingly:  the  expected  number  of  coin  tosses 
per  processor  is  less  than  one. 

Brian  Coan  [Cl]  has  developed  a  two-step  transformation  of  algorithms  in  various  fault  models  (fail- 
stop.  failure-by-omission,  and  Byzantine),  to  a  communication-efficient  normal  form.  The  first  step  is  a 
transformation  into  a  well  known  communication-inefficient  normal  form  in  which  each  processor,  at  each 
round,  broadcasts  its  entire  state.  The  second  step  is  a  new  transformation  from  this  communication- 
inefficient  form  to  to  a  communication-efficient  normal  form.  For  each  fault  model  there  is  a  separate 
transformation.  The  transformation  in  the  Byzantine  model  is  fully  worked  out,  and  more  work  is  needed 
for  the  other  fault  models. 

a  corollary  to  the  results  in  the  Byzantine  fault  model,  Brian  obtained  a  major  new  result  about  the 
communication  requirements  of  Byzantine  agreement  This  new  result  is  a  polynomial-message  Byzantine 
agreement  algorithm  that  uses  about  half  the  rounds  of  communication  used  by  any  other  polynomial- 
message  algorithm. 

The  problem  of  achieving  simultaneity  in  the  presence  of  faults  first  appeared  implicitly  in  work  of 
Rabin  He  had  an  algorithm  for  reaching  consensus  whose  expected  running  time  was  constant;  however, 
different  processors  might  terminate  at  different  rounds.  The  implicit  question  was:  "Does  there  exist  an 
algorithm  for  achieving  simultaneity  that  runs  in  time  strictly  less  than  O(t)  (the  lower  bound  for 
agreement  in  a  deterministic  algorithm)9" .  We  have  answered  this  question  negatively,  in  CD  :  we  have 
shown  that  not  only  is  there  no  fast  deterministic  algorithm  for  achieving  simultaneity,  but  there  is  not 
even  a  randomized  algorithm  whose  expected  running  time  is  less  than  t~l  rounds,  where  the  expectation 
is  taken  over  the  coin  flip  sequences.  These  results  only  assume  fail-stop  faults,  and  therefore  apply  a 
fortiori  to  more  malicious  failure  models. 

This  work  was  continued  by  Dwork  with  Yoram  Moses  DM'  They  weakened  the  restrictions  in  CD  . 
on  the  failure  patterns  for  which  the  lower  bound  could  be  proved.  In  fact,  they  have  actually  been  able 
to  completely  characterize  the  time  requirements  (at  least  for  certain  consensus  problems  in  which 
simultaneous  termination  is  required,  and  for  stopping  faults).  That,  is.  they  are  able  to  exhibit  a  simple 
protocol  that  is  optimal  in  the  sense  that  it  always  halts  at  the  earliest  possible  time,  given  the  pattern  in 
which  the  processors  fail.  This  is  often  much  earlier  than  the  best  previously  known  protocol  for  tins 
problem. 

This  work  was  further  continued  by  Yoram  Moses  and  Mark  Tuttle  MT  In  this  continuation,  they 
apply  the  theory  of  knowledge  in  distributed  systems  to  important  problems  in  distributed  computing 
Wh  ereas  the  paper  DM1  analyzes  simultaneous  Byzantine  agreement,  in  the  crash  failure  mode]  by 
studying  when  facts  become  common  knowledge,  the  present  paper  extend  the  previous  paper  to  general 
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simultaneous  actions  in  a  variety  of  models  of  omission  failures.  These  papers  show  that  the  major  issue 
in  designing  protocols  for  simultaneous  actions  in  unreliable  systems  is  the  uncertainly  that  individual 
processors  have  about  other  processors’  views  of  the  system.  These  papers  demonstrate  that  a  knowledge- 
based  analysis  can  provide  substantial  insight  and  improved  protocols  for  such  problems.  In  particular, 
they  show  that  it  is  possible  to  design  protocols  for  simultaneous  actions  that  in  all  of  their  runs  will  halt 
at  the  earliest  possible  time,  given  the  behavior  of  the  system. 

Coan  and  Lundelius  [CL]  have  studied  the  transaction  commit  problem  in  a  realistic  partially 
synchronous  computation  model.  Namely,  they  assume  that  message  delays  and  relative  processor  speeds 
are  unbounded,  and  processors  are  subject  to  stopping  faults.  The  time  behavior  of  the  system  during  an 
execution  influences  the  correctness  conditions  as  follows:  if  any  processor  votes  to  abort,  then  all 
processors  must  decide  to  abort;  if  all  processors  vote  to  commit  and  if  no  processors  fail  and  all  messages 
arrive  within  a  known  time  bound,  then  all  processors  must  decide  to  commit.  The  nonfaulty  processors 
must  always  agree  on  their  decision.  In  this  model,  they  describe  a  randomized  transaction  commit 
protocol  based  on  Ben-Or’s  randomized  asynchronous  Byzantine  agreement  protocol  The  expected 
number  of  asynchronous  rounds  until  the  protocol  terminates  is  a  small  constant,  and  the  number  of 
stopping  faults  tolerated  is  optimal.  It  is  known  that  no  deterministic  protocol  is  possible  in  this  model. 

B.  Approximate  Agreement 

In  [DLPSWj,  we  give  a  new  algorithm  and  matching  lower  bound  for  the  problem  of  reaching 
approximate  agreement  (for  example,  agreement  on  the  value  of  a  sensor)  among  processors  in  a 
distributed  network.  Interestingly,  the  problem  turns  out  to  be  considerably  easier  than  the  problem  of 
reaching  exact  agreement.  In  particular,  our  solutions  work  in  asynchronous  networks  with  faults, 
whereas  it  has  been  previously  shown  that  no  solution  to  exact  agreement  is  possible  in  such  a  network. 

A  version  of  [DLPSW]  was  prepared,  submitted  and  accepted  to  JACM.  New  results  were  obtained, 
showing  how  only  3t-t-l  processors  suffice  to  reach  approximate  agreement  in  an  asynchronous 
environment  with  t  faults,  and  showing  how  faulty  processors  can  be  rendered  unable  to  determine  the 
worst-case  running  time  for  the  algorithm.  A  new  lower  bound  was  also  obtained  for  the  rate  at  which 
the  approximation  can  converge. 

Alan  Fekete  has  obtained  some  preliminary  results  which  show  how  the  theoretical  bound  on  rate  of 
convergence  can  be  approached  by  an  actual  algorithm. 

C.  Clock  Synchronization 

In  LuLl.LuL2.Lul  .  we  study  the  problem  of  synchronizing  software  clocks  in  a  distributed  system. 

LuLl  contains  a  new  clock  synchronization  algorithm  for  use  in  a  system  in  which  some  processors 
exhibit  worst-case  (Byzantine)  faults;  it  is  aide  not  only  to  maintain  synchronization,  but  also  to  bring  the 
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clocks  into  synchronization  in  the  first  place.  Moreover,  it  enables  easy  recovery  of  failed  processors. 
[LuL2]  contains  a  surprising  lower  bound  on  the  closeness  with  which  clocks  can  be  synchronized  in  the 
presence  of  uncertainty  in  the  message  delivery  time.  The  lower  bound  is  shown  to  be  tight. 

Jennifer  Lundelius  implemented  a  slightly  modified  version  of  the  clock  synchronization  algorithm  from 
LuLll  at  AT&T  Bell  Laboratories  this  summer.  The  program  was  written  in  the  C  language  and  was 
designed  to  synchronize  the  clocks  of  Suns  running  Berkeley  Unix  on  an  Ethernet.  The  algorithm  had  to 
be  modified  in  an  interesting  way  because  of  the  reality  of  the  Ethernet  —  it  does  not  provide  reliable, 
bounded  delay  communication  as  well  as  a  broadcast  primitive.  This  paper  describes  the  necessary 
modifications,  analyzes  the  worst-case  performance  of  the  new  algorithm,  and  gives  an  overview  of  the 
program. 

D.  Electing  a  Leader 

In  [FL],  we  study  the  communication  cost  of  the  very  important  problem  of  electing  a  leader  in  a 
network  of  processors.  Our  results  are  for  the  special  but  i  aportant  case  of  a  synchronous,  bidirectional 
ring  network.  They  show  that  any  algorithm  which  solves  this  problem  must  use  at  least  order  n  log  n 
messages.  An  interesting  combinatorial  technique  is  used. 

E.  Network  Resource  Allocation 

fFGGL,LGFG,FLBBi  are  papers  about  network  resource  allocation.  Most  of  the  results  in  these  papers 
were  obtained  a  couple  of  years  ago;  however,  we  have  been  settling  some  open  cases  and  polishing  up  the 
presentation. 

[LGFG]  was  finally  completed  and  sent  to  Information  and  Control.  Over  the  past  year  or  more,  we 
have  improved  this  work  in  many  ways.  The  most  recent  improvements  involve  generalizing  the  analysis 
to  allow  arbitrary  probability  distributions  of  request  arrivals,  and  to  the  case  where  resources,  as  well  as 
requests,  occur  at  locations  that  are  determined  probabilistically. 

LGFG  contains  criteria  for  optimal  placements  of  resources  in  a  network;  work  is  still  in  progress  on 
this. 

A  new  version  of  [FLBB]  was  prepared  and  submitted  for  publication 

F.  Atomic  Registers 

Vitanyi  and  Awerbuch  jVA’  have  studied  the  feasibility  of  atomic  shared  register  access  by 
asynchronous  hardware.  The  problem  is  to  construct  multivalued  registers  which  can  be  read  and  written 
asynchronously  bv  many  processes  in  a  consistent  fashion.  Moreover,  it  is  required  that  any  process 
should  be  able  to  proceed  without  waiting  for  any  other  process.  Using  atomic  1-reader.  1-wnter 
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registers,  they  have  constructed  atomic  multireader,  multiwriter  registers  using  unbounded  tags. 

In  (Bj,  we  give  an  algorithm  allowing  two  processors  with  one-writer  atomic  registers  with  2n  values 
each  to  simulate  a  n-value  atomic  register  that  they  can  both  write.  This  is  still  work  in  progress. 

G.  Other  Network  Problems 

[AG,AM,ACMG]  describe  new  results  on  some  interesting  network  problems.  [AG]  gives  an  efficient, 
though  complicated,  new  algorithm  for  performing  breadth-first  search  of  a  distributed  network.  'AM1 
involves  development  of  a  new  algorithm  for  detecting  and  breaking  deadlocks  among  processes  in  a 
network. 

[AGMS]  gives  a  new  algorithm  for  carrying  out  a  “global  coin  toss"  in  an  unreliable  network.  This  is  a 
very  fundamental  problem,  since  many  existing  protocols  rely  on  the  existence  of  such  a  coin.  They  have 
proposed  a  new,  efficient  protocol  that  produces  a  provably  fair  coin  in  the  presence  of  malicious 
adversaries.  Their  solution  uses  weak  cryptographic  assumptions. 

Paul  Vitanyi  has  been  working  on  a  problem  of  distributed  control  [KVj.  He  has  studied  the  number  of 
messages  required  for  matching  pairs  of  mobile  processes  in  a  multiprocessor  network;  this  is  a  measure 
for  the  cost  of  setting  up  temporary  communication  between  such  processes.  He  has  established  lower 
bounds  on  the  average  number  of  point-to-point  transmissions  between  any  pair  of  nodes  in  this  context. 
Applications  of  the  results  include  lower  bounds  on  the  number  of  messages  required  to  implement  a 
distributed  name-server,  and  to  solve  distributed  mutual  exclusion  and  distributed  resource  allocation 
problems. 

Coan  has  worked  on  limitations  on  database  availability  when  networks  partition  [CoOK].  In  designing 
fault-tolerant  distributed  database  systems,  a  frequent  goal  is  to  make  the  system  highly  available  despite 
component  failure.  They  describe  a  way  of  measuring  availability  and  prove  a  lower  bound  on  the 
availability  that  can  be  achieved  by  any  on-line  replicated  data  management  protocol  that  maintains 
database  consistency.  This  bound  holds  under  a  certain  uniformity  assumption  on  the  pattern  of  data 
accesses  by  transactions. 

II.  Models 

Gene  Stark's  PhD  thesis  [Si]  was  completed  during  this  reporting  period;  it  contains  a  formal 
foundation  for  a  theory  of  specification  of  modules  in  distributed  systems. 

Mark  Tuttle  has  been  working  with  Nancy  Lynch  LT  on  resource  allocation  algorithms  and  their 
correctness  proofs.  We  have  been  using  a  new  "levels  of  abstraction"  organization  for  proofs  of 
correctness  of  certain  distributed  algorithms.  In  particular,  we  have  been  applying  it  to  prove  the 
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correctness  of  a  new  design  for  a  distributed  arbiter  algorithm  The  j-r«mf  urgam/.ite.n  j . t < ,%  i,  1.--  ,  T,,  w 
way  of  understanding  distributed  algorithms  in  terms  of  the  "abstract  knowledge"  present  at  .-a.  n  node 

In  order  to  carry  out  a  clean,  hierarchical  proof,  vve  have  found  it  necessary  to  develop  a  clear  formal 
foundation  for  this  work.  A  basic  semantic  model  for  concurrent  computation  has  been  defined;  it  is 
based  on  a  simple  component  which  we  call  an  I/O  automaton.  One  important  aspect  of  this  model  is  tin- 
division  of  process  actions  into  input  actions  and  output  actions,  which  permits  us  to  model  the  notion  of 
a  “fair  computation"  easily.  Our  model  captures  the  game-theoretic  nature  of  distributed  computation. 
It  includes  treatment  of  both  finite  and  infinite  properties  of  module  behavior.  It  allows  organization  of 
algorithms  using  several  conceptual  levels  of  abstraction. 

The  I/O  automaton  model  has  been  applied  to  several  different  areas  of  concurrent  computing.  For 
example,  Jennifer  Lundelius  Welch  is  using  the  model  to  describe  shared  memory  algorithms.  In 
particular,  she  is  describing  a  well-known  n-process  mutual  exclusion  algorithm  of  Peterson  and  Fischer  in 
a  more  modular  way  than  they  do.  Two  advantages  are  gained.  First,  any  2-process  mutual  exclusion 
algorithm  can  be  used  as  a  subroutine  in  the  tournament  tree  in  her  formulation,  instead  of  just  Peterson 
and  Fischer’s  2-process  solution.  Second,  the  time  performance  can  be  reduced  from  O(n')  to  0(n  log  n). 
An  important  aspect  of  this  work  is  the  development  (still  in  progress)  of  time  measures  for  asynchronous 
systems,  to  be  integrated  with  the  I/O  automaton  model. 

Another  application  of  the  model  has  been  to  produce  a  more  modular  description  of  a  family  of 
solutions  to  Chandy  and  Misra’s  Drinking  Philosophers  problem.  Unlike  their  original  presentation,  the 
new  description  produces  a  drinking  philosophers  algorithm  from  an  arbitrary  dining  philosophers 
algorithm  as  a  subroutine.  This  work  is  at  preliminary  stages;  it  has  not  yet  been  written  up. 

Also,  with  Dr.  Leslie  Lamport,  I  have  been  attempting  a  formal  proof,  in  levels  of  abstraction,  of  the 
very  well-known  minimum  spanning  tree  algorithm  of  Gallager,  Humblet  and  Spira.  This  has  met  with 
only  partial  success  so  far. 

The  principal  contribution  of  [DS]  is  the  introduction  of  a  new  type  of  reduction  designed  expressly  for 
distributed  systems.  This  reduction  classifies  distributed  problems  by  the  communication  requirements  of 
their  solutions. 

In  KS  ,  we  propose  a  new  method  for  the  analysis  of  cooperative  and  antagonistic  properties  of 
communicating  finite  state  processes  (FSP’s).  This  algebraic  technique  is  based  on  a  composition  operator 
and  the  notion  of  "possibility  equivalence"  among  FSP's.  We  demonstrate  its  utility  by  showing  that 
potential  blocking,  lockout,  and  termination  can  be  efficiently  decided  for  loosely  connected  networks  of 
tree  FSP’s.  If  not  all  acyclic  FSP’s  are  trees,  then  the  cooperative  properties  become  NP-complete  and  the 
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antagonistic  ones  PSPACE-eomplete.  We  also  have  related  results  for  tightly  coupled  network.-  and  for 
the  considerably  harder  cyclic  process  case. 

Lundelius  has  shown  [L2j  how  a  distributed  system  with  synchronous  processors  and  asynchronous 
communication  can  be  simulated  by  a  system  in  which  both  processors  and  communication  are 
asynchronous,  in  the  presence  of  various  types  of  processor  failures.  Consequently,  a  result  of  Dolev. 
Dwork  and  Stockmeyer,  that  no  consensus  protocol  in  a  system  with  synchronous  processors  and 
asynchronous  communication  can  tolerate  even  one  failstop  processor,  follows  from  the  result  of  Fischer. 
Lynch  and  Paterson,  that  fault-tolerant  consensus  is  impossible  when  both  processors  and  communication 
are  asynchronous. 

III.  Concurrency  Control  and  Recovery 
A.  Nested  Transactions 

We  have  been  engaged  in  an  ambitious  project  to  provide  a  natural  formal  foundation  for  concurrency 
control  and  resiliency.  Our  goal  is  to  provide  a  framework  within  which  researchers,  developers  and 
implententers  can  discuss  interesting  requirements  and  algorithms  for  distributed  transaction-processing 
systems.  This  area  is  of  critical  importance  to  distributed  computing,  but  the  work  is  currently  described 
in  hundreds  of  unrelated  research  papers,  with  no  common  framework  to  aid  in  comprehension.  We  are 
especially  interested  in  a  theory  to  underlie  "nested  transactions",  an  important  new  language  construct 
for  distributed  computing. 

The  paper  Ll  .  on  a  preliminary  model  for  nested  transactions  and  a  proof  of  correctness  for  an 
exclusive  locking  algorithm,  was  revised  and  sent  back  to  Advances  in  Computing  Research  for  final 
publication. 

The  paper  LM  contains  the  first  reports  on  our  results  on  a  new,  cleaner  and  more  expressive  model  for 
nested  transaction  concurrency  control  and  recovery.  This  framework  appears  to  be  satisfactory  for  all 
its  purposes.  The  paper  includes  a  statement  of  a  basic  correctness  condition  to  be  satisfied  by  all  nested 
transaction  systems.  It  also  contains  a  correctness  proof  for  an  exclusive  locking  algorithm.  Part  of  this 
effort  includes  description  of  the  implementation  of  data  objects  with  resiliency  properties,  in  terms  of 
basic  data  objects  without  such  properties  The  presentation  and  proofs  are  much  simpler  and  give  more 
insight  than  previous  work. 

We  have  begun  to  extend  this  work  in  several  different  directions.  With  Alan  Fekete,  Michael  Merritt 
and  Prof.  Hill  Weihl,  we  have  proved  correctness  of  an  important  practical  algorithm.  Moss'  read-write 
locking  algorithm  for  nested  transactions  FI. MW  This  algorithm  is  currently  implemented  in  the 
ARGl'S  system 
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With  Prof  Maurice  Ilerlihy  and  Merritt  and  Weilil  HLMW  ,  I  have  been  working  on  . i 

proving  correctness  of  several  algorithms  for  the  detection  and  elimination  of  “orphan"  tr.in-a  t , 

transact . .  with  ancestors  that  abort  If  not  managed  properly,  orphan  transactions  pose  a  dantr-r  <.t 

•'ail'. tig  damage,  or  wasting  system  resources,  so  several  algorithms  have  been  designed  for  managing 
orphan'  in  \  annus  systems  What  has  not  been  clear  until  now  is  exactly  why  these  algorithms  are 
curre-T  or  even  what  it  means  for  them  to  be  correct.  We  have  been  able  to  describe  precise  correctness 
conditions  withm  our  model,  and  have  shown  correctness  of  two  important  orphan-management 
algorithms  The  proofs  have  been  very  clear,  easy  and  short,  in  marked  contrast  to  earlier  attempts  to 
carry  out  such  proofs.  \\  e  are  currently  studying  some  other  orphan-management  algorithms,  inclml.ng 
some  tii.it  work  in  the  presence  of  system  node  crashes  which  lose  the  contents  of  volatile  memory 

With  Ken  Coidman.  1  have  been  working  on  modelling  replicated  data  management  algorithms  <>I. 
Here,  we  are  interested  in  algorithms  for  managing  replicated  data  in  the  presence  of  site  and 
communication  failures  (including  network  partitions).  This  work  involves  unifying  the  known  results 
f  r  non- nested  transactions)  and  extending  the  results  to  nested  transactions.  All  of  this  work  has 
pi  •  led  \  "i".  '.iccessfully .  and  papers  are  in  various  advanced  stages  of  progress. 

la  somewhat  lex  advanced  stages  of  progress  is  work  which  is  aimed  toward  a  general  theorem  about 
ii-'t -.1  tran-a ■•lions  and  abstract  objects,  work  on  modelling  crashes  in  distributed  networks,  and  work  on 
time''. imp-based  algorithms  for  implementing  nested  transactions.  AL)  of  these  are  important  areas,  and 
w  will  continue  this  work  in  the  future. 

B.  Highly  Available  Replicated  Data  Systems 

Nancy  |.ynch’<  consulting  work  at  CCA  has  led  to  several  new  research  ideas.  CCA  is  building  a 
di-ini  .P'l  t  r  a  ii'. act  ion  processing  system  (SHARD)  which  is  intended  to  work  in  a  SAC  environment,  in 
whi'di  •'.  mi  mum  at  ion  is  very  unreliable.  I  have  been  involved  in  the  system's  design  and  specification.  In 
[.articular  1  have  helped  design  reliable  broadcast  algorithms  for  use  with  unreliable  packet  radio 
cimniuim  ai  mil'  CLBK.SS  .  and  algorithms  to  change  system  configuration  during  its  operation  SL  . 

Mi.'t  . . ntly.  I  have  been  developing  a  set  of  correctness  properties  to  describe  the  guarantees  which 

SHARD  is  able  to  make  to  its  users  LBS  Systems  such  as  SHARD  sacrifice  strong  correctness  conditions 
such  as  "s'-rializability " .  in  the  interests  of  performance.  In  environments  with  unreliable  communication, 
it  1 1 1 a \  be  m’ci  s.sary  to  do  this  It  i'  important .  however,  to  he  aide  to  make  some  precise  guarantee' 
about  what  'ifH  s\ 'tents  can  do  The  guarantees  made  by  the  system  include  nonstop  operation, 
preo-rc  a* ion  of  data  consistency  in  case  of  nonfaulty  communications,  hounds  of  "costs"  of  inconsistencies 
during  ■•■Main  kinds  of  faulty  •  omnium,  at  ions,  and  certain  "fairness"  guarantees  It  is  important  to 
make  ' 1 1 •  •  h  guarantees  explicit,  so  that  the  very  novel  approach  embodied  in  the  CCA  system  can  he 
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